require(ggplot2)
require(dplyr)
require(car)
require(MASS)
require(gridExtra)
require(GGally)
require(qdata)
Janka Hardness is an importance rating of Australian hardwood timbers. The test itself measures the force required to imbed a steel ball into a piece of wood and therefore provides a good indication to how the timber will withstand denting and wear. Janka hardness is strongly related to the density of the timber and can usually be modelled using a polynomial relationship.
The dataset consists of density and hardness measurements from 36 Australian Eucalypt hardwoods.
Variables in dataframe are:
data(janka)
head(janka)
## Source: local data frame [6 x 2]
##
## Density Hardness
## (dbl) (int)
## 1 24.7 484
## 2 24.8 427
## 3 27.3 413
## 4 28.4 517
## 5 28.4 549
## 6 29.0 648
str(janka)
## Classes 'tbl_df', 'tbl' and 'data.frame': 36 obs. of 2 variables:
## $ Density : num 24.7 24.8 27.3 28.4 28.4 29 30.3 32.7 35.6 38.5 ...
## $ Hardness: int 484 427 413 517 549 648 587 704 979 914 ...
Let us begin with a plot showing the relation between Hardness and Density
ggp <- ggplot(data = janka, mapping = aes(x = Density, y = Hardness)) +
geom_point(colour="darkblue", size=2) +
geom_smooth(method = "loess", colour="green", se = FALSE, span=0.8)
print(ggp)
Scatterplot of Hardness Vs. Density with loess line
And now a similar graph but with lm regression line added:
ggp <- ggplot(data=janka, mapping = aes(x=Density, y=Hardness)) +
geom_point(colour="darkblue", size=2) +
geom_smooth( method = "lm", colour="red", se=FALSE, size=1) +
geom_smooth( method = "loess", colour="green", se=FALSE, size=1, span=0.8)
print(ggp)
Scatterplot of Hardness Vs. Density with loess and lm linear fit
It seems that a simple linear model, or, at most, a slightly curved model, could be used to fit the data.
Let we start with a simple linear model
fm1 <- lm(Hardness ~ Density, data = janka)
summary(fm1)
##
## Call:
## lm(formula = Hardness ~ Density, data = janka)
##
## Residuals:
## Min 1Q Median 3Q Max
## -338.40 -96.98 -15.71 92.71 625.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1160.500 108.580 -10.69 2.07e-12 ***
## Density 57.507 2.279 25.24 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 183.1 on 34 degrees of freedom
## Multiple R-squared: 0.9493, Adjusted R-squared: 0.9478
## F-statistic: 637 on 1 and 34 DF, p-value: < 2.2e-16
Parameters are all highly significant.
Now a residual plot to check correctness of model:
op <- par(mfrow=c(2,2))
plot(fm1)
par(op)
Residual plot of straight linear model
One observation (32) seems to be an outlier. Residuals would also suggest to add the square of explanatory variable to the model.
We will try to “increase” the model until a good result is obtained.
Zoom to first plot on residuals:
res_fitted_df <- data.frame(res=residuals(fm1), fit=fitted(fm1))
ggp <- ggplot(data= res_fitted_df, mapping = aes(x=fit, y=res)) +
geom_point(color="darkblue", size=2) +
stat_smooth(method="loess", colour="green", se=FALSE, span = 0.8) +
geom_hline(yintercept = 0, color="red", linetype=2) +
ylab("Residuals") + xlab("Fitted Values") +
ggtitle("Hardness ~ Density Residuals Plot")
print(ggp)
Residuals Vs. fitted values
Since a curvature (and, perhaps, heteroscedasticity, but we will be analyze this later) appear in residuals, a quadratic term is added to model:
fm2 <- update(fm1, . ~ . + I(Density^2))
Wald’s \(t\)-test p-value’s are:
summary(fm2)
##
## Call:
## lm(formula = Hardness ~ Density + I(Density^2), data = janka)
##
## Residuals:
## Min 1Q Median 3Q Max
## -326.63 -81.35 -15.27 59.87 537.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -118.0074 334.9669 -0.352 0.72686
## Density 9.4340 14.9356 0.632 0.53197
## I(Density^2) 0.5091 0.1567 3.248 0.00267 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 161.7 on 33 degrees of freedom
## Multiple R-squared: 0.9616, Adjusted R-squared: 0.9593
## F-statistic: 413.2 on 2 and 33 DF, p-value: < 2.2e-16
We can ask R to show only a part of the summary
summary(fm2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -118.0073759 334.966905 -0.3522956 0.726856611
## Density 9.4340214 14.935620 0.6316458 0.531969926
## I(Density^2) 0.5090775 0.156721 3.2483031 0.002669045
And the ANOVA for nested models is:
anova(fm2,fm1)
## Analysis of Variance Table
##
## Model 1: Hardness ~ Density + I(Density^2)
## Model 2: Hardness ~ Density
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 863325
## 2 34 1139366 -1 -276041 10.552 0.002669 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The contribution of quadratic term is significant.
First degree coefficient is not significant: the axis of parabola may be in 0.
Let us try to add a cubic term:
fm3 <- update(fm2, . ~ . + I(Density^3))
summary(fm3)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.414379e+02 1.235655e+03 -0.5191076 0.6072576
## Density 4.686373e+01 8.630180e+01 0.5430215 0.5908777
## I(Density^2) -3.311734e-01 1.913986e+00 -0.1730281 0.8637192
## I(Density^3) 5.958701e-03 1.352646e-02 0.4405220 0.6625207
anova(fm2,fm3)
## Analysis of Variance Table
##
## Model 1: Hardness ~ Density + I(Density^2)
## Model 2: Hardness ~ Density + I(Density^2) + I(Density^3)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 33 863325
## 2 32 858121 1 5204 0.1941 0.6625
Adding third degree term does not add significant contributions to model.
In this case intercept is not significant.
However, because of marginality principle, when a p-th degree term of an explanatory variable is significant, statisticians tend to keep all the lower-degree terms (intercept, second degree, third degree, …, (p-1)-th degree) of the same exploratory variable.
The same for interactions: if we keep a three-way interaction, also all two-way interactions and main effects of variables included in three-way interaction should we keep in the model.
If, for example, a linear effect within a more complex model is not significant, this does not necessarily mean that the linear effect does not exist. Sometimes (or, perhaps better, frequently), the non-significancy of a term could be simply due to the design or to the chosen measurement unit.
Consequently, based on marginality principle, it does not make sense to check significancy of main effects (or lower order effects) if interaction effects (or higher order effects) involving main effects are significant.
In this case, if we try to transform the independent variable centering it around its median with:
janka <- janka %>% mutate(norm_density = Density - median(Density))
This is the plot of relation:
ggp <- ggplot(data = janka, mapping = aes(x = norm_density, y = Hardness)) +
geom_point(colour="darkblue") +
geom_smooth(method = "loess", colour="green", se = FALSE, span=0.8) +
geom_smooth(method = "lm", colour="red", se = FALSE)
print(ggp)
Hardness Vs. normalized (shifted on median) Density plot
The general “behavior” of relation is mainly the same, but the “normalized” model estimates are:
fmd <- lm(Hardness~norm_density+I(norm_density^2), janka)
summary(fmd)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1165.8152232 36.852634 31.634516 2.908743e-26
## norm_density 51.9928972 2.633321 19.744232 7.579819e-20
## I(norm_density^2) 0.5090775 0.156721 3.248303 2.669045e-03
Here, the second degree coefficient is identical to the previous one, whereas intercept and first degree coefficient are now significant.
This modified result is obtained only shifting the explanatory variable. Significativity of lower degree terms may then depend on measurement scale or design.
Finally, notice also that centering the explanatory variables is a tool to “reduce collinearity”, and then to reduce correlation between coefficient estimates:
summary(fm2,correlation=TRUE)$correlation
## (Intercept) Density I(Density^2)
## (Intercept) 1.0000000 -0.9864144 0.9581079
## Density -0.9864144 1.0000000 -0.9908737
## I(Density^2) 0.9581079 -0.9908737 1.0000000
summary(fmd,correlation=TRUE)$correlation
## (Intercept) norm_density I(norm_density^2)
## (Intercept) 1.0000000 0.2528465 -0.6471058
## norm_density 0.2528465 1.0000000 -0.6445979
## I(norm_density^2) -0.6471058 -0.6445979 1.0000000
Now, we make other checks with studentized residuals (see Some Theory on Linear Models chapter):
res_fitted_df <- data.frame(res=studres(fmd), fit=fitted(fmd))
ggp1 <- ggplot(data= res_fitted_df, mapping = aes(x=fit, y=res)) +
geom_point(color="darkblue", size=2) +
stat_smooth(method="loess", colour="green", se=FALSE, span = 0.8) +
geom_hline(yintercept = 0, color="red", linetype=4) +
ylab("Residuals") + xlab("Fitted Values") +
ggtitle("Residuals vs Fit Plot")
ggp2 <- ggplot(data = res_fitted_df, mapping = aes(sample = res)) +
stat_qq(color="darkblue", size=2) +
geom_abline(mapping=aes(intercept=mean(res),slope=sd(res)), color="red", linetype=2) +
xlab("Normal scores") + ylab("Sorted studentized residuals") +
ggtitle("Residuals Normal Probability Plot")
grid.arrange(ggp1, ggp2, ncol=2)
Residual variability seems non constant (heteroscedasticity exists), and seems to increase as the response variable increases. We should search for a transformation of the response variable which produces models with “more normal” (and with reduced heteroscedasticity) residuals.
We will search for the transformation within the Box-Cox family (see Some Theory on Linear Models chapter):
bxcx <- boxcox(fmd,lambda = seq(-0.25, 1, len=20))
Box-Cox log-likelihood Vs. Lambda
boxcox() returns a plot with the MLE of \(\lambda\) and a confidence interval for it, identified by the intersections of horizontal dotted line with log-likelihood. When 0 is included in the confidence interval, often \(\lambda\)=0 (the logarithm) is an advisable choice.
We then re-analyze the model using the log-transform of the dependent variable:
lfmd <- lm(log(Hardness) ~ norm_density + I(norm_density^2), janka)
or, equivalently,
lfmd <- update(fmd, log(.) ~ .)
The coefficients significances:
round(summary(lfmd)$coef, 4)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0499 0.0230 307.0530 0
## norm_density 0.0478 0.0016 29.1418 0
## I(norm_density^2) -0.0005 0.0001 -5.3542 0
The coefficients are all significant.
And finally (studentized) residual check:
res_fitted_df <- data.frame(res=studres(lfmd), fit=fitted(lfmd))
ggp1 <- ggplot(data= res_fitted_df, mapping = aes(x=fit, y=res)) +
geom_point(color="darkblue", size=2) +
stat_smooth(method="loess", colour="green", se=FALSE, span = 0.8) +
geom_hline(yintercept = 0, color="red", linetype=4) +
ylab("Residuals") + xlab("Fitted Values") +
ggtitle("Residuals vs Fit Plot")
ggp2 <- ggplot(data = res_fitted_df, mapping = aes(sample = res)) +
stat_qq(color="darkblue", size=2) +
geom_abline(mapping=aes(intercept=mean(res),slope=sd(res)), color="red", linetype=2) +
xlab("Normal scores") + ylab("Sorted studentized residuals") +
ggtitle("Residuals Normal Probability Plot")
grid.arrange(ggp1, ggp2, ncol=2)
Studentized residual plot
Plots are improved and this last model seems the best one.
Now, since the response is the logarithm of y, contributions of the explanatory variables are multiplicative with respect to y.
And standard residual check:
op <- par(mfrow = c(2,2))
plot(lfmd)
par(op)
‘Standard’ residual plot on final model
Now the plot of models predictions (in log-scale and in original scale) with 95% Prediction Intervals.
pred <- data.frame(predict(lfmd, interval = "prediction"))
plot_df <- data.frame(d = janka$norm_density, h = log(janka$Hardness), pred)
ggp1 <- ggplot(data = plot_df, mapping = aes(x=d, y=h)) +
geom_point(col = "darkblue", size = 2) +
geom_line(mapping=aes(x=d, y=fit), col = "mediumvioletred") +
geom_line(mapping=aes(x=d, y=lwr), col = "mediumorchid1", linetype=4) +
geom_line(mapping=aes(x=d, y=upr), col = "mediumorchid1", linetype=4) +
xlab("Normalized Density") + ylab("log (Hardness)")
ggp2 <- ggplot(data = plot_df, mapping = aes(x=d, y=exp(h))) +
geom_point(col = "darkblue", size = 2) +
geom_line(mapping=aes(x=d, y=exp(fit)), col = "mediumvioletred") +
geom_line(mapping=aes(x=d, y=exp(lwr)), col = "mediumorchid1", linetype=4) +
geom_line(mapping=aes(x=d, y=exp(upr)), col = "mediumorchid1", linetype=4) +
xlab("Normalized Density") + ylab("log (Hardness)")
grid.arrange(ggp1, ggp2, ncol=2)
Final model predictions with 95% intervals in log-scale and original scale
Following data returns levels of an air pollutant, “Oxidant”, together with levels of four meteorological variables recorded on 30 days during one summer. Which, if any, of the four indipendent variables seem to be related to levels of oxidant?
data(oxidant)
str(oxidant)
## Classes 'tbl_df', 'tbl' and 'data.frame': 30 obs. of 6 variables:
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ windspeed : int 50 47 57 38 52 57 53 62 52 42 ...
## $ temperature: int 77 80 75 72 71 74 78 82 82 82 ...
## $ humidity : int 67 66 77 73 75 75 64 59 60 62 ...
## $ insolation : int 78 77 73 69 78 80 75 78 75 58 ...
## $ oxidant : int 15 20 13 21 12 12 12 11 12 20 ...
head(oxidant)
## Source: local data frame [6 x 6]
##
## day windspeed temperature humidity insolation oxidant
## (int) (int) (int) (int) (int) (int)
## 1 1 50 77 67 78 15
## 2 2 47 80 66 77 20
## 3 3 57 75 77 73 13
## 4 4 38 72 73 69 21
## 5 5 52 71 75 78 12
## 6 6 57 74 75 80 12
Let us start producing a matrix of scatterplots:
ggpairs(data = oxidant)
Matrix of scatterplots between metereological variables and also Oxidant
The above line of code draws a matrix of scatterplots between all possible pairs of variables.
Some relations between oxidant and the independent variables appear.
We will use the models to asses which independent variables actually influence the dependent variable.
We begin estimating a linear model with all the independent variables, excluding day.
fm1 <- lm(oxidant ~ windspeed+temperature+humidity+insolation, data = oxidant)
summary(fm1)
##
## Call:
## lm(formula = oxidant ~ windspeed + temperature + humidity + insolation,
## data = oxidant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5861 -1.0961 0.3512 1.7570 4.0712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.49370 13.50647 -1.147 0.26219
## windspeed -0.44291 0.08678 -5.104 2.85e-05 ***
## temperature 0.56933 0.13977 4.073 0.00041 ***
## humidity 0.09292 0.06535 1.422 0.16743
## insolation 0.02275 0.05067 0.449 0.65728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.92 on 25 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.7657
## F-statistic: 24.69 on 4 and 25 DF, p-value: 2.279e-08
The model seems to fit data well enough (R-square equal to 0.798). Now let us try to “expand” the initial model adding all two-way interactions.
fm2 <- lm(oxidant ~ (windspeed+temperature+humidity+insolation)^2, data = oxidant)
summary(fm2)
##
## Call:
## lm(formula = oxidant ~ (windspeed + temperature + humidity +
## insolation)^2, data = oxidant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1559 -1.3606 -0.0746 1.2529 4.0781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.358e+02 1.620e+02 -2.073 0.0521 .
## windspeed -2.474e-02 1.914e+00 -0.013 0.9898
## temperature 2.485e+00 1.338e+00 1.858 0.0788 .
## humidity 3.611e+00 1.645e+00 2.195 0.0408 *
## insolation 3.510e+00 1.675e+00 2.096 0.0497 *
## windspeed:temperature -7.655e-03 2.429e-02 -0.315 0.7561
## windspeed:humidity 4.703e-03 1.083e-02 0.434 0.6691
## windspeed:insolation -1.361e-03 8.757e-03 -0.155 0.8782
## temperature:humidity -1.086e-02 1.072e-02 -1.013 0.3239
## temperature:insolation -1.414e-02 1.292e-02 -1.095 0.2874
## humidity:insolation -3.758e-02 2.226e-02 -1.688 0.1078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.607 on 19 degrees of freedom
## Multiple R-squared: 0.8776, Adjusted R-squared: 0.8132
## F-statistic: 13.62 on 10 and 19 DF, p-value: 1.146e-06
The Wald statistics of two-way interactions are all non significant. Now we will check if all interactions together are not significant, comparing the two above models:
anova(fm2, fm1, test = "F")
## Analysis of Variance Table
##
## Model 1: oxidant ~ (windspeed + temperature + humidity + insolation)^2
## Model 2: oxidant ~ windspeed + temperature + humidity + insolation
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 129.10
## 2 25 213.09 -6 -83.984 2.06 0.107
There is a non-significant difference between the two models (adding all interaction effects seems to not produce better fit). fm1 seems the better model.
summary(fm1)
##
## Call:
## lm(formula = oxidant ~ windspeed + temperature + humidity + insolation,
## data = oxidant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5861 -1.0961 0.3512 1.7570 4.0712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.49370 13.50647 -1.147 0.26219
## windspeed -0.44291 0.08678 -5.104 2.85e-05 ***
## temperature 0.56933 0.13977 4.073 0.00041 ***
## humidity 0.09292 0.06535 1.422 0.16743
## insolation 0.02275 0.05067 0.449 0.65728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.92 on 25 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.7657
## F-statistic: 24.69 on 4 and 25 DF, p-value: 2.279e-08
Let us verify if the humidity:insolation interaction (the one with smaller p-value) added alone is always non-significant.
fm3 <- update(fm1,.~.+humidity:insolation)
summary(fm3)
##
## Call:
## lm(formula = oxidant ~ windspeed + temperature + humidity + insolation +
## humidity:insolation, data = oxidant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0097 -2.0182 0.4324 1.3284 3.5777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -202.42439 66.60656 -3.039 0.005654 **
## windspeed -0.43519 0.07659 -5.682 7.48e-06 ***
## temperature 0.48459 0.12681 3.822 0.000826 ***
## humidity 3.10346 1.05698 2.936 0.007219 **
## insolation 2.53887 0.88321 2.875 0.008343 **
## humidity:insolation -0.03923 0.01375 -2.852 0.008786 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.575 on 24 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8177
## F-statistic: 27.02 on 5 and 24 DF, p-value: 4.027e-09
The humidity:insolation interaction is significant! fm3 could be the best model.
anova(fm1,fm3)
## Analysis of Variance Table
##
## Model 1: oxidant ~ windspeed + temperature + humidity + insolation
## Model 2: oxidant ~ windspeed + temperature + humidity + insolation + humidity:insolation
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 213.09
## 2 24 159.14 1 53.952 8.1367 0.008786 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now the usual residuals checks:
op <- par(mfrow = c(2, 2))
plot(fm3)
par(op)
Residual plot of first model rescribing Oxidant
In case of models with many explanatory variables a useful tool to check residuals is to compare the residuals also with the explanatory variables.
Now, plot all pairs with residuals
oxidant2 <- oxidant %>% bind_cols(data.frame(res=residuals(fm3)))
ggpairs(cbind(oxidant2))
Matrix of scatterplot between all variables and also residuals
Looking at the residuals versus the explanatory variables scatterplots (last row of matrix), it seems that humidity^2 may be useful to explain oxidant:
fm4 <- update(fm3,.~.+I(humidity^2))
summary(fm4)
##
## Call:
## lm(formula = oxidant ~ windspeed + temperature + humidity + insolation +
## I(humidity^2) + humidity:insolation, data = oxidant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1449 -1.6823 0.0983 1.2565 3.5587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.001e+02 8.260e+01 -1.212 0.237653
## windspeed -4.412e-01 7.269e-02 -6.070 3.43e-06 ***
## temperature 4.836e-01 1.202e-01 4.022 0.000533 ***
## humidity 9.939e-01 1.487e+00 0.669 0.510395
## insolation 1.560e+00 9.804e-01 1.591 0.125317
## I(humidity^2) 8.249e-03 4.293e-03 1.922 0.067136 .
## humidity:insolation -2.420e-02 1.521e-02 -1.591 0.125171
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.442 on 23 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.8361
## F-statistic: 25.65 on 6 and 23 DF, p-value: 4.226e-09
Maybe, many other coefficients (humidity, insolation and the interaction between them) were significant only because the square of humidity was not included in the model.
Let us first try to drop the interaction:
fm4 <- update(fm4,.~.-humidity:insolation)
summary(fm4)
##
## Call:
## lm(formula = oxidant ~ windspeed + temperature + humidity + insolation +
## I(humidity^2), data = oxidant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0380 -1.5176 -0.2186 1.7743 3.6686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.2321174 18.3024288 1.543 0.136028
## windspeed -0.4473010 0.0748704 -5.974 3.63e-06 ***
## temperature 0.5216385 0.1215488 4.292 0.000252 ***
## humidity -1.2705075 0.4437174 -2.863 0.008566 **
## insolation 0.0008605 0.0442765 0.019 0.984655
## I(humidity^2) 0.0117636 0.0037974 3.098 0.004913 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.518 on 24 degrees of freedom
## Multiple R-squared: 0.8557, Adjusted R-squared: 0.8256
## F-statistic: 28.46 on 5 and 24 DF, p-value: 2.388e-09
Now the model is simpler, and insolation seems not significant.
Now let us try to remove non-significant effects.
Always remove terms one-by-one :
fm <- update(fm4, .~.-insolation)
summary(fm)
##
## Call:
## lm(formula = oxidant ~ windspeed + temperature + humidity + I(humidity^2),
## data = oxidant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.037 -1.514 -0.212 1.772 3.670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.234869 17.932249 1.575 0.12794
## windspeed -0.447426 0.073085 -6.122 2.12e-06 ***
## temperature 0.522791 0.103961 5.029 3.46e-05 ***
## humidity -1.271667 0.430808 -2.952 0.00678 **
## I(humidity^2) 0.011775 0.003673 3.206 0.00366 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.468 on 25 degrees of freedom
## Multiple R-squared: 0.8557, Adjusted R-squared: 0.8326
## F-statistic: 37.06 on 4 and 25 DF, p-value: 3.626e-10
Now all coefficients (except for intercept) are significant.
We make further checks with overparametererization, adding an interaction:
anova(fm, update(fm , .~.+windspeed:temperature), test = "F")
## Analysis of Variance Table
##
## Model 1: oxidant ~ windspeed + temperature + humidity + I(humidity^2)
## Model 2: oxidant ~ windspeed + temperature + humidity + I(humidity^2) +
## windspeed:temperature
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 152.22
## 2 24 150.54 1 1.6795 0.2677 0.6096
Adding the new term does not add any useful information to model.
fm seems the best model. Let us check the model with usual residuals plots:
op <- par(mfrow = c(2, 2))
plot(fm)
par(op)
Residual plot for second model
Residuals plots seem better than before.
And now plot all pairs to check if some other relation may emerge.
oxidant3 <- oxidant %>% bind_cols(res=data.frame(residuals(fm)))
ggpairs(oxidant3)
Matrix of scatterplot between variables and also residuals
The model seems good enough.
Data contains calories and sodium content for three types of hot dogs: 20 Beef, 17 Meat and 17 Poultry hot dogs.
This data is a sample from much larger populations. We want to find if type of hotdog and sodium content may influence the calories contents in hot dogs themselves.
data(hotdogs)
str(hotdogs)
## Classes 'tbl_df', 'tbl' and 'data.frame': 54 obs. of 3 variables:
## $ type : Factor w/ 3 levels "beef","meat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ calories: int 186 181 176 149 184 190 158 139 175 148 ...
## $ sodium : int 495 477 425 322 482 587 370 322 479 375 ...
head(hotdogs)
## Source: local data frame [6 x 3]
##
## type calories sodium
## (fctr) (int) (int)
## 1 beef 186 495
## 2 beef 181 477
## 3 beef 176 425
## 4 beef 149 322
## 5 beef 184 482
## 6 beef 190 587
Now let us produce a boxplot for each type of hot dog:
ggp <- ggplot(data = hotdogs, mapping = aes(x = type, y=calories, fill=type)) +
geom_boxplot()
print(ggp)
Box and whisker plot of calories by type of hot dog
Some difference appears between types of hot dogs. Beef and meat hot dogs seem similar.
Following graphs represent the relation between calories and sodium without and with categorization per type. Splitting on categorization is then performed in several ways.
ggp <- ggplot(data=hotdogs, mapping = aes(x=sodium, y=calories)) +
geom_point(size=2, color="darkblue")
print(ggp)
Simple scatterplot of calories Vs. sodium
ggp <- ggplot(data=hotdogs, mapping = aes(x=sodium, y=calories, color=type)) +
geom_point(size=2)
print(ggp)
Scatterplot of calories Vs. sodium with types with different colors
ggp <- ggplot(data=hotdogs, mapping = aes(x=sodium, y=calories, shape=type)) +
geom_point(size=2)
print(ggp)
Scatterplot of calories Vs. sodium with types with different symbols
ggp <- ggplot(data=hotdogs, mapping = aes(x=sodium, y=calories, shape= type, color=type)) +
geom_point(size=2)
print(ggp)
Scatterplot of calories Vs. sodium with types with different colors and symbols
An increasing relationship between calories and sodium seems to exist with differences with respect to types.
Plot calories vs sodium by type:
ggp <- ggplot(data=hotdogs, mapping = aes(x=sodium,y=calories)) +
geom_point(color="darkblue", size=2)+
geom_smooth(method = "lm", se=FALSE, colour="red") +
geom_smooth(method = "loess", se=FALSE, colour="green") +
facet_wrap(facets = ~type)
print(ggp)
Scatterplot of calories Vs. sodium by type
Let us start with the estimation of full model, with main effects and interactions:
fm <- lm(calories ~type*sodium, data = hotdogs)
summary(fm)
##
## Call:
## lm(formula = calories ~ type * sodium, data = hotdogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.384 -8.879 1.251 10.407 28.031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.19241 12.84656 6.087 1.85e-07 ***
## typemeat -16.63121 20.38531 -0.816 0.4186
## typepoultry -40.27566 23.00901 -1.750 0.0864 .
## sodium 0.19608 0.03108 6.310 8.42e-08 ***
## typemeat:sodium 0.03603 0.04828 0.746 0.4592
## typepoultry:sodium -0.01994 0.05140 -0.388 0.6997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.88 on 48 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.777
## F-statistic: 37.93 on 5 and 48 DF, p-value: 1.465e-15
summary.aov(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## type 2 17692 8846 45.95 7.09e-12 ***
## sodium 1 18614 18614 96.68 4.39e-13 ***
## type:sodium 2 212 106 0.55 0.581
## Residuals 48 9242 193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model includes the intercept, a coefficient for the dummy variable for meat intercept, a coefficient for the dummy variable for poultry intercept, the sodium variable coefficient, the interaction coefficients between type and sodium which tell us how much sodium slope coefficient varies when type is meat or poultry instead of beef.
sodium and type coefficients are significant, the interaction between sodium slope and type is not.
p-values of aov table suggest a three parallel regression lines model.
Notice that, since the design in unbalanced, we obtain different results switching the order of independent variables in model formulation.
fm <- lm(calories ~ sodium * type, data = hotdogs)
summary(fm)
##
## Call:
## lm(formula = calories ~ sodium * type, data = hotdogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.384 -8.879 1.251 10.407 28.031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.19241 12.84656 6.087 1.85e-07 ***
## sodium 0.19608 0.03108 6.310 8.42e-08 ***
## typemeat -16.63121 20.38531 -0.816 0.4186
## typepoultry -40.27566 23.00901 -1.750 0.0864 .
## sodium:typemeat 0.03603 0.04828 0.746 0.4592
## sodium:typepoultry -0.01994 0.05140 -0.388 0.6997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.88 on 48 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.777
## F-statistic: 37.93 on 5 and 48 DF, p-value: 1.465e-15
summary.aov(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## sodium 1 9986 9986 51.86 3.61e-09 ***
## type 2 26321 13160 68.35 9.00e-15 ***
## sodium:type 2 212 106 0.55 0.581
## Residuals 48 9242 193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value for interaction remains always the same in this case, and is non significant. The p-values of main effects, however, are always less than \(\alpha=\) 0.05.
Now we will remove the interaction term:
fm <- update(fm, .~.-sodium:type)
summary.aov(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## sodium 1 9986 9986 52.81 2.29e-09 ***
## type 2 26321 13160 69.61 3.55e-15 ***
## Residuals 50 9453 189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fm)
##
## Call:
## lm(formula = calories ~ sodium + type, data = hotdogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.548 -8.552 0.495 8.800 30.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.73501 8.73406 8.671 1.55e-11 ***
## sodium 0.20221 0.02038 9.922 2.09e-13 ***
## typemeat -1.65834 4.54974 -0.364 0.717
## typepoultry -49.78292 4.68663 -10.622 2.03e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.75 on 50 degrees of freedom
## Multiple R-squared: 0.7934, Adjusted R-squared: 0.781
## F-statistic: 64.01 on 3 and 50 DF, p-value: < 2.2e-16
The difference in calories contents between meat and beef hot dogs is non significant.
However, now we will try to change parameterization to make the model more readable:
fm <- lm(calories ~ sodium + type - 1, data = hotdogs)
This parameterization for fm may generally be used to better understand the meaning of model and parameters values.
In such as formulation, summary.aov(fm) usually does not make sense because it compares the model with 0 gran mean with the analyzed model.
summary(fm)
##
## Call:
## lm(formula = calories ~ sodium + type - 1, data = hotdogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.548 -8.552 0.495 8.800 30.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## sodium 0.20221 0.02038 9.922 2.09e-13 ***
## typebeef 75.73501 8.73406 8.671 1.55e-11 ***
## typemeat 74.07667 9.15796 8.089 1.21e-10 ***
## typepoultry 25.95209 9.93062 2.613 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.75 on 50 degrees of freedom
## Multiple R-squared: 0.992, Adjusted R-squared: 0.9914
## F-statistic: 1558 on 4 and 50 DF, p-value: < 2.2e-16
The parameters of model specified in this manner are more easily interpretable: there are three parallel straight lines with slope equal to 0.2022, and intercepts equal to 75.7350, 74.0767, 25.9521 for beef, meat, and poultry, respectively.
Since beef and meet hot dogs seem similar, we try to compare their intercepts using Helmert-type contrasts:
contrasts(hotdogs$type) <- matrix(c(-1,-1,1,-1,0,2),ncol=2,nrow=3,byrow=TRUE)
fm3 <- lm(calories ~ sodium + type, data = hotdogs)
summary(fm3)
##
## Call:
## lm(formula = calories ~ sodium + type, data = hotdogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.548 -8.552 0.495 8.800 30.221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.58792 8.88644 6.593 2.60e-08 ***
## sodium 0.20221 0.02038 9.922 2.09e-13 ***
## type1 -0.82917 2.27487 -0.364 0.717
## type2 -16.31792 1.38519 -11.780 4.90e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.75 on 50 degrees of freedom
## Multiple R-squared: 0.7934, Adjusted R-squared: 0.781
## F-statistic: 64.01 on 3 and 50 DF, p-value: < 2.2e-16
type1 coefficient represents the comparison between beef and meet intercepts.
Since type1 coefficient p-value is greater than \(\alpha\)=0.05, beef and meet hot-dogs can be considered equal.
On the contrary, type2 coefficient is the difference between poultry intercept and meet and beef pooled intercept effect. type2 is quite significant, and then poultry hot dog type may be considered different from the other (pooled) types.
Let us try to recode factor type to obtain only two types of hot-dogs.
hotdogs$type2 <- as.character(hotdogs$type)
hotdogs$type2[hotdogs$type2 == "beef"] = "meat"
hotdogs$type2 <- factor(hotdogs$type2, levels = c("meat", "poultry"))
And now the new model is fitted to data:
fm1 <- lm(calories ~ sodium + type2, data = hotdogs)
summary(fm1)
##
## Call:
## lm(formula = calories ~ sodium + type2, data = hotdogs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.500 -7.904 0.253 9.578 30.172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.20964 8.54076 8.806 8.18e-12 ***
## sodium 0.20163 0.02014 10.010 1.25e-13 ***
## type2poultry -48.99215 4.11877 -11.895 2.52e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.63 on 51 degrees of freedom
## Multiple R-squared: 0.7929, Adjusted R-squared: 0.7847
## F-statistic: 97.61 on 2 and 51 DF, p-value: < 2.2e-16
Next two lines of code compare residual standard deviation of model with three types of hot-dogs with residual standard deviation of model with two types of hot-dogs:
summary(fm)$sigma
## [1] 13.75008
summary(fm1)$sigma
## [1] 13.63268
They are very similar (the residual standard deviation of “reduced” model is actually smaller), and this confirms that meat and beef hot-dogs are really similar, and may be considered equal in terms of relation between calories and sodium.
The usual diagnostic plots for the model with three and two types of hot dogs, respectively, are reported below.
op <- par(mfrow = c(2, 2))
plot(fm)
par(op)
Residual plot of model with distinct types for meat and beef
op <- par(mfrow = c(2, 2))
plot(fm1)
par(op)
Residual plot of model with pooled types for meat and beef
Either graphs report that the residuals are “very good”.
Data contain weight, height, gender and geographical area (“Nord”, “Centro”, “Sud” and “Isole”) from 1806 Italian people.
The main goal of this analysis is on finding a simple model that fits the relation between height and weight of italian people.
data(istat)
str(istat)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1806 obs. of 4 variables:
## $ Gender: Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 2 2 2 2 ...
## $ Area : Factor w/ 4 levels "Centro","Isole",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Weight: int 64 57 68 64 70 60 45 76 70 77 ...
## $ Height: int 172 167 175 173 172 170 160 177 170 174 ...
head(istat)
## Source: local data frame [6 x 4]
##
## Gender Area Weight Height
## (fctr) (fctr) (int) (int)
## 1 Female Nord 64 172
## 2 Female Nord 57 167
## 3 Male Nord 68 175
## 4 Male Nord 64 173
## 5 Male Nord 70 172
## 6 Male Nord 60 170
Initially, we produce a plot that shows the relationship between Weight and Height:
ggp <- ggplot(data=istat, mapping = aes(x=Height, y=Weight))+
geom_point(color="darkblue")
print(ggp)
Scatterplot of Weight Vs. Height
And then the above plot by gender with regression lines:
ggp <- ggplot(data=istat,mapping = aes(x=Height, y= Weight)) +
geom_point(colour="darkblue") +
facet_wrap(facets = ~Gender) +
geom_smooth(method = "lm", colour= "red", se=FALSE)
print(ggp)
Scatterplot of Weight Vs. Height by Gender
Now, a plot by area and gender with regression lines:
ggp <- ggplot(data=istat, mapping = aes(x=Height, y= Weight)) +
geom_point(colour="darkblue") +
facet_grid(facets = Gender~Area) +
geom_smooth(method = "lm", colour= "red", se=FALSE)
print(ggp)
Scatterplot of Weight Vs. Height by Area and Gender
After having seen the above graphs, we will try a model with all two-way interactions.
fm <- lm(Weight ~ (Height + Gender + Area)^2, data = istat)
summary.aov(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## Height 1 133942 133942 1551.623 < 2e-16 ***
## Gender 1 19117 19117 221.462 < 2e-16 ***
## Area 3 1979 660 7.640 4.49e-05 ***
## Height:Gender 1 2758 2758 31.955 1.83e-08 ***
## Height:Area 3 390 130 1.506 0.211
## Gender:Area 3 56 19 0.217 0.884
## Residuals 1793 154778 86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we simplify the model by removing all non significant interactions and checking if their global contribution is significant:
fm <- lm(Weight ~ Height + Gender + Area + Height:Gender, data = istat)
summary.aov(fm)
## Df Sum Sq Mean Sq F value Pr(>F)
## Height 1 133942 133942 1552.339 < 2e-16 ***
## Gender 1 19117 19117 221.564 < 2e-16 ***
## Area 3 1979 660 7.643 4.46e-05 ***
## Height:Gender 1 2758 2758 31.970 1.82e-08 ***
## Residuals 1799 155225 86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fm,update(fm, .~. +Height:Area+Gender:Area))
## Analysis of Variance Table
##
## Model 1: Weight ~ Height + Gender + Area + Height:Gender
## Model 2: Weight ~ Height + Gender + Area + Height:Gender + Height:Area +
## Gender:Area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1799 155225
## 2 1793 154778 6 446.3 0.8617 0.5224
The simplified model assesses that the slope between Height and Weight is modified by Gender only. Differences in intercepts appear, due to Area and Gender.
Now we plot the diagnostic graphs.
op <- par(mfrow = c(2, 2))
plot(fm)
par(op)
Residual plot of first model
The residuals are not symmetric.
We re-plot graph by gender with smoothing as reference for modelling.
ggp <- ggplot(data=istat,mapping = aes(x=Height, y= Weight)) +
geom_point(colour="darkblue") +
facet_wrap(facets = ~Gender) +
geom_smooth(method = "lm", colour= "red", se=FALSE) +
geom_smooth(method = "loess", colour= "green", se=FALSE, span=0.5)
print(ggp)
Scatterplot of Weight Vs. Height by Gender with lowess and fitted line
Now we try to transform data by using Box-Cox to correct skewness of residuals, and an apparent non-linearity in relation.
This is the model:
fm <- lm(Weight ~ Area + Height * Gender, data = istat)
And then the Box-Cox transformation analysis:
boxcox(fm, lambda = seq(-0.5, 0.5, 1/20))
grid()
Box-Cox likelihood graph for Istat toy model
A transformation with \(\lambda=-\frac{1}{5}\) seems a good choice.
This is the model on transformed scale:
fm_l <- update(fm, Weight^(-0.2)~.)
summary.aov(fm_l)
## Df Sum Sq Mean Sq F value Pr(>F)
## Area 3 0.00052 0.00017 1.303 0.2718
## Height 1 0.22376 0.22376 1690.675 <2e-16 ***
## Gender 1 0.03225 0.03225 243.665 <2e-16 ***
## Height:Gender 1 0.00115 0.00115 8.712 0.0032 **
## Residuals 1799 0.23810 0.00013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us remove Area, that seems not significant, to complete the analysis.
fm_l <- update(fm_l, .~.-Area)
(Note: Is this choice completely correct? Is Area surely non significant?)
op <- par(mfrow = c(2, 2))
plot(fm_l)
par(op)
Residual plot for Istat toy model
Residuals are not really normal, but as a “toy” model, this result can be considered acceptable enough.
Now we calculate the predicted values on transformed scale:
newdata <- data.frame(Height = c(150:199, 150:199), Gender = factor(rep(c("Female", "Male"), each = 50)))
pred <- predict(fm_l, newdata = newdata, se = T, interval = "prediction")
And the predicted values with prediction interval on transformed scale:
newdata$fit <- pred$fit[,1] # predicted values
newdata$lwr <- pred$fit[,2] # UPL
newdata$upr <- pred$fit[,3] # LPL
ylim <- range(istat$Weight^(-0.2))*c(0.95, 1.05) # compute y range
Now the predictions on transformed scale are plotted:
ggp1 <- ggplot(data=istat %>% filter(Gender=="Male"),mapping=aes(x=Height, y=Weight^(-0.2))) +
geom_point(color="darkblue") +
geom_line(data=newdata %>% filter(Gender=="Male"),
mapping=aes(x=Height, y=fit), color="mediumvioletred") +
geom_line(data=newdata %>% filter(Gender=="Male"),
mapping=aes(x=Height, y=upr), color="mediumorchid1", linetype=2) +
geom_line(data=newdata %>% filter(Gender=="Male"),
mapping=aes(x=Height, y=lwr), color="mediumorchid1", linetype=2) +
ggtitle("Male") + ylim(ylim)
ggp2 <- ggplot(data=istat %>% filter(Gender=="Female"),mapping=aes(x=Height, y=Weight^(-0.2))) +
geom_point(color="darkblue") +
geom_line(data=newdata %>% filter(Gender=="Female"),
mapping=aes(x=Height, y=fit), color="mediumvioletred") +
geom_line(data=newdata %>% filter(Gender=="Female"),
mapping=aes(x=Height, y=upr), color="mediumorchid1", linetype=2) +
geom_line(data=newdata %>% filter(Gender=="Female"),
mapping=aes(x=Height, y=lwr), color="mediumorchid1", linetype=2) +
ggtitle("Female") +ylim(ylim)
grid.arrange(ggp1, ggp2, ncol=2)
Plot of model predictions by Gender with 95% prediction intervals on transformed scale
Now we prepare some data to plot preditions on original scale
ylim <- range(istat$Weight * c(0.95, 1.05))
And then the predictions are actually plotted:
ggp1 <- ggplot(data=istat %>% filter(Gender=="Male"), mapping=aes(x=Height, y=Weight)) +
geom_point(color="darkblue") +
geom_line(data=newdata %>% filter(Gender=="Male"),
mapping=aes(x=Height, y=(fit^(-5))), color="mediumvioletred") +
geom_line(data=newdata %>% filter(Gender=="Male"),
mapping=aes(x=Height, y=(upr^(-5))), color="mediumorchid1", linetype=2) +
geom_line(data=newdata %>% filter(Gender=="Male"),
mapping=aes(x=Height, y=(lwr^(-5))), color="mediumorchid1", linetype=2) +
ggtitle("Male") + ylim(ylim)
ggp2 <- ggplot(data=istat %>% filter(Gender=="Female"), mapping=aes(x=Height, y=Weight)) +
geom_point(color="darkblue") +
geom_line(data=newdata %>% filter(Gender=="Female"),
mapping=aes(x=Height, y=(fit^(-5))), color="mediumvioletred") +
geom_line(data=newdata %>% filter(Gender=="Female"),
mapping=aes(x=Height, y=(upr^(-5))), color="mediumorchid1", linetype=2) +
geom_line(data=newdata %>% filter(Gender=="Female"),
mapping=aes(x=Height, y=(lwr^(-5))), color="mediumorchid1", linetype=2) +
ggtitle("Female") + ylim(ylim)
grid.arrange(ggp1, ggp2, ncol=2)
Plot of model predictions by Gender with 95% prediction intervals on original scale
The Iowa dataset is a toy example that summarizes the yield of wheat (bushels per acre) for the state of Iowa between 1930-1962. In addition to yield, year, rainfall and temperature were recorded as the main predictors of yield:
Year: Year of harvest
Rain0: Pre-season rainfall
Temp1: Mean temperature for growing month 1
Rain1: Rainfall for growing month 1
Temp2: Mean temperature for growing month 2
Rain2: Rainfall for growing month 2
Temp3: Mean temperature for growing month 3
Rain3: Rainfall for growing month 3
Temp4: Mean temperature for harvest month
Yield: Yield in bushels per acre
We want to find a first descriptive model that approximates the relation between Yield and the other variables.
data(iowheat)
str(iowheat)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33 obs. of 10 variables:
## $ Year : int 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 ...
## $ Rain0: num 17.8 14.8 28 16.8 11.4 ...
## $ Temp1: num 60.2 57.5 62.3 60.5 69.5 55 66.2 61.8 59.5 66.4 ...
## $ Rain1: num 5.83 3.83 5.17 1.64 3.49 7 2.85 3.8 4.67 5.32 ...
## $ Temp2: num 69 75 72 77.8 77.2 65.9 70.1 69 69.2 71.4 ...
## $ Rain2: num 1.49 2.72 3.12 3.45 3.85 3.35 0.51 2.63 4.24 3.15 ...
## $ Temp3: num 77.9 77.2 75.8 76.4 79.7 79.4 83.4 75.9 76.5 76.2 ...
## $ Rain3: num 2.42 3.3 7.1 3.01 2.84 2.42 3.48 3.99 3.82 4.72 ...
## $ Temp4: num 74.4 72.6 72.2 70.5 73.4 73.6 79.2 77.8 75.7 70.7 ...
## $ Yield: num 34 32.9 43 40 23 38.4 20 44.6 46.3 52.2 ...
head(iowheat)
## Source: local data frame [6 x 10]
##
## Year Rain0 Temp1 Rain1 Temp2 Rain2 Temp3 Rain3 Temp4 Yield
## (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 1930 17.75 60.2 5.83 69.0 1.49 77.9 2.42 74.4 34.0
## 2 1931 14.76 57.5 3.83 75.0 2.72 77.2 3.30 72.6 32.9
## 3 1932 27.99 62.3 5.17 72.0 3.12 75.8 7.10 72.2 43.0
## 4 1933 16.76 60.5 1.64 77.8 3.45 76.4 3.01 70.5 40.0
## 5 1934 11.36 69.5 3.49 77.2 3.85 79.7 2.84 73.4 23.0
## 6 1935 22.71 55.0 7.00 65.9 3.35 79.4 2.42 73.6 38.4
We begin the analysis producing a matrix of scatterplots:
ggpairs(iowheat)
Matrix of scatterplot between dataframe variables
Some relations seem to appear. Now we will try to produce some models.
Let us start with a second degree model:
fmA <- lm(Yield ~ .^2, data = iowheat)
fmA
##
## Call:
## lm(formula = Yield ~ .^2, data = iowheat)
##
## Coefficients:
## (Intercept) Year Rain0 Temp1 Rain1
## -2.446e+04 1.679e+00 3.303e+03 -1.881e+03 -1.338e+03
## Temp2 Rain2 Temp3 Rain3 Temp4
## 1.286e+03 2.580e+03 9.384e+01 -4.304e+03 2.162e+02
## Year:Rain0 Year:Temp1 Year:Rain1 Year:Temp2 Year:Rain2
## -1.328e+00 1.038e+00 7.515e-01 -6.253e-01 -8.104e-01
## Year:Temp3 Year:Rain3 Year:Temp4 Rain0:Temp1 Rain0:Rain1
## -6.729e-03 2.140e+00 -6.895e-02 6.029e-01 -6.329e+00
## Rain0:Temp2 Rain0:Rain2 Rain0:Temp3 Rain0:Rain3 Rain0:Temp4
## -2.215e+00 -6.736e+00 -4.389e+00 -2.576e+00 -2.796e+00
## Temp1:Rain1 Temp1:Temp2 Temp1:Rain2 Temp1:Temp3 Temp1:Rain3
## -4.033e+00 -7.335e-01 -1.248e+01 -1.054e-03 3.010e+00
## Temp1:Temp4 Rain1:Temp2 Rain1:Rain2 Rain1:Temp3 Rain1:Rain3
## -5.905e-01 4.537e+00 -2.260e+01 NA NA
## Rain1:Temp4 Temp2:Rain2 Temp2:Temp3 Temp2:Rain3 Temp2:Temp4
## NA NA NA NA NA
## Rain2:Temp3 Rain2:Rain3 Rain2:Temp4 Temp3:Rain3 Temp3:Temp4
## NA NA NA NA NA
## Rain3:Temp4
## NA
Unfortunately, there are more parameters than observations:
dim(iowheat)
## [1] 33 10
and then we cannot estimate all the parameters.
We then try with a second model that uses first degree effects only:
fmA <- lm(Yield ~ ., data = iowheat)
summary(fmA)
##
## Call:
## lm(formula = Yield ~ ., data = iowheat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3901 -4.3086 0.5322 4.6877 16.7007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.637e+03 4.010e+02 -4.083 0.000458 ***
## Year 8.756e-01 1.879e-01 4.660 0.000109 ***
## Rain0 7.845e-01 4.297e-01 1.826 0.080919 .
## Temp1 -4.598e-01 4.290e-01 -1.072 0.294953
## Rain1 -7.794e-01 1.058e+00 -0.737 0.468692
## Temp2 4.822e-01 5.734e-01 0.841 0.409048
## Rain2 2.557e+00 1.382e+00 1.851 0.077099 .
## Temp3 4.891e-02 7.212e-01 0.068 0.946518
## Rain3 4.078e-01 1.033e+00 0.395 0.696554
## Temp4 -6.563e-01 6.698e-01 -0.980 0.337382
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.815 on 23 degrees of freedom
## Multiple R-squared: 0.7476, Adjusted R-squared: 0.6488
## F-statistic: 7.568 on 9 and 23 DF, p-value: 4.354e-05
Now let us try to use automatic model selection procedures.
In a first attempt a lower stepwise procedure is used. Such a procedure tries to build an “optimum” model starting from a (almost) complete model and then removing the non significant terms with a step-wise fashion, one by one.
During the process, if some removed term become significant, it is re-inserted in model.
In this case, the model must always include Year as an explicative variable (see the scope parameter).
lower_step_aic_fm <- stepAIC(fmA, scope = list(lower = ~Year))
## Start: AIC=143.79
## Yield ~ Year + Rain0 + Temp1 + Rain1 + Temp2 + Rain2 + Temp3 +
## Rain3 + Temp4
##
## Df Sum of Sq RSS AIC
## - Temp3 1 0.281 1405.1 141.79
## - Rain3 1 9.525 1414.4 142.01
## - Rain1 1 33.159 1438.0 142.56
## - Temp2 1 43.193 1448.0 142.79
## - Temp4 1 58.637 1463.5 143.14
## - Temp1 1 70.159 1475.0 143.40
## <none> 1404.8 143.79
## - Rain0 1 203.575 1608.4 146.25
## - Rain2 1 209.204 1614.0 146.37
##
## Step: AIC=141.8
## Yield ~ Year + Rain0 + Temp1 + Rain1 + Temp2 + Rain2 + Rain3 +
## Temp4
##
## Df Sum of Sq RSS AIC
## - Rain3 1 9.876 1415.0 140.03
## - Rain1 1 37.336 1442.5 140.66
## - Temp2 1 43.066 1448.2 140.79
## - Temp4 1 61.843 1467.0 141.22
## - Temp1 1 69.964 1475.1 141.40
## <none> 1405.1 141.79
## - Rain0 1 221.747 1626.9 144.63
## - Rain2 1 231.803 1636.9 144.83
##
## Step: AIC=140.03
## Yield ~ Year + Rain0 + Temp1 + Rain1 + Temp2 + Rain2 + Temp4
##
## Df Sum of Sq RSS AIC
## - Rain1 1 41.234 1456.2 138.97
## - Temp2 1 54.232 1469.2 139.27
## - Temp1 1 70.039 1485.0 139.62
## - Temp4 1 87.976 1503.0 140.02
## <none> 1415.0 140.03
## - Rain2 1 222.363 1637.4 142.84
## - Rain0 1 286.500 1701.5 144.11
##
## Step: AIC=138.97
## Yield ~ Year + Rain0 + Temp1 + Temp2 + Rain2 + Temp4
##
## Df Sum of Sq RSS AIC
## - Temp1 1 46.810 1503.0 138.02
## - Temp4 1 78.535 1534.8 138.71
## - Temp2 1 82.223 1538.5 138.79
## <none> 1456.2 138.97
## - Rain0 1 255.372 1711.6 142.31
## - Rain2 1 300.665 1756.9 143.17
##
## Step: AIC=138.02
## Yield ~ Year + Rain0 + Temp2 + Rain2 + Temp4
##
## Df Sum of Sq RSS AIC
## - Temp2 1 51.562 1554.6 137.13
## <none> 1503.0 138.02
## - Temp4 1 126.714 1629.8 138.69
## - Rain0 1 247.570 1750.6 141.05
## - Rain2 1 263.348 1766.4 141.35
##
## Step: AIC=137.13
## Yield ~ Year + Rain0 + Rain2 + Temp4
##
## Df Sum of Sq RSS AIC
## <none> 1554.6 137.13
## - Temp4 1 187.95 1742.6 138.90
## - Rain0 1 196.01 1750.6 139.05
## - Rain2 1 240.20 1794.8 139.87
summary(lower_step_aic_fm)
##
## Call:
## lm(formula = Yield ~ Year + Rain0 + Rain2 + Temp4, data = iowheat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7088 -3.5819 0.7129 4.4983 11.9795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1556.1763 288.5405 -5.393 9.47e-06 ***
## Year 0.8519 0.1498 5.688 4.25e-06 ***
## Rain0 0.5989 0.3187 1.879 0.0707 .
## Rain2 2.3613 1.1353 2.080 0.0468 *
## Temp4 -0.9755 0.5302 -1.840 0.0764 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.451 on 28 degrees of freedom
## Multiple R-squared: 0.7206, Adjusted R-squared: 0.6807
## F-statistic: 18.06 on 4 and 28 DF, p-value: 1.954e-07
The final model has the terms: Year, Rain0, Rain2, and Temp4.
A second attempt with a upper stepwise procedure is then tried. Such a procedure tries to build an “optimum” model starting from a “only mean” model and then adding significant terms, one by one, with a step-wise fashion.
During the process, if some added term become non significant, it is removed from model.
In this case, the maximal model has all two-way interactions (see the scope parameter again):
upper_step_aic_fm <- stepAIC(lower_step_aic_fm, scope = list(upper = ~.^2))
## Start: AIC=137.13
## Yield ~ Year + Rain0 + Rain2 + Temp4
##
## Df Sum of Sq RSS AIC
## + Rain2:Temp4 1 294.61 1260.0 132.20
## <none> 1554.6 137.13
## + Year:Temp4 1 32.48 1522.1 138.44
## + Rain0:Rain2 1 17.17 1537.4 138.76
## - Temp4 1 187.95 1742.6 138.90
## + Year:Rain0 1 8.65 1546.0 138.95
## - Rain0 1 196.01 1750.6 139.05
## + Year:Rain2 1 1.91 1552.7 139.09
## + Rain0:Temp4 1 0.98 1553.6 139.11
## - Rain2 1 240.20 1794.8 139.87
## - Year 1 1796.22 3350.8 160.47
##
## Step: AIC=132.2
## Yield ~ Year + Rain0 + Rain2 + Temp4 + Rain2:Temp4
##
## Df Sum of Sq RSS AIC
## <none> 1260.0 132.20
## + Year:Rain2 1 30.43 1229.6 133.39
## + Rain0:Temp4 1 19.81 1240.2 133.68
## + Rain0:Rain2 1 18.49 1241.5 133.71
## + Year:Temp4 1 1.74 1258.2 134.15
## + Year:Rain0 1 1.00 1259.0 134.17
## - Rain0 1 224.37 1484.4 135.61
## - Rain2:Temp4 1 294.61 1554.6 137.13
## - Year 1 1837.91 3097.9 159.88
summary(upper_step_aic_fm)
##
## Call:
## lm(formula = Yield ~ Year + Rain0 + Rain2 + Temp4 + Rain2:Temp4,
## data = iowheat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8002 -4.4060 0.5752 5.0040 11.1749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1394.7481 272.2220 -5.124 2.19e-05 ***
## Year 0.8621 0.1374 6.276 1.03e-06 ***
## Rain0 0.6418 0.2927 2.193 0.03712 *
## Rain2 -63.5971 26.2717 -2.421 0.02249 *
## Temp4 -3.4277 1.0903 -3.144 0.00403 **
## Rain2:Temp4 0.8963 0.3567 2.513 0.01826 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.831 on 27 degrees of freedom
## Multiple R-squared: 0.7736, Adjusted R-squared: 0.7317
## F-statistic: 18.45 on 5 and 27 DF, p-value: 5.839e-08
The final model has the terms: Year, Rain0, Rain2, Temp4, Rain2:Temp4. It is slightly more complex than the previous one.
The model arising from the upper stepwise process is then the final one, and then residual analysis follows:
op <- par(mfrow = c(2,2))
plot(upper_step_aic_fm)
par(op)
Residual plot of model